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We study the two-dimensional classical XY model by the large-scale Monte Carlo simulation of the Swendsen-Wang 
multi-cluster algorithm using multiple GPUs on the open science supercomputer TSUBAME 2.0. Simulating systems up 
to the linear system size L = 65536, we investigate the Kosterlitz-Thouless (KT) transition. Using the generalized version 
of the probability-changing cluster algorithm based on the helicity modulus, we locate the KT transition temperature in 
a self-adapted way. The obtained inverse KT temperature /?kt is 1.11996(6). We estimate the exponent to specify the 
multiplicative logarithmic correction, -2r, and precisely reproduce the theoretical prediction -2r = 1/8. 
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The two-dimensional (2D) classical XY model (planar ro- 
tator model) shows a unique phase transition, the Kosterlitz- 
Thouless (KT) transition.''^' It does not have true long-range 
order, but the correlation function decays as a power of 
the distance at all the temperatures below the KT transition 
point. Compared to the algebraic divergence of the correlation 
length ^(T) for the second-order phase transition, the correla- 
tion length diverges much faster as 



^{T) cc exp(c/ Vf) = A exp(c/ Vf) 



(1) 



with t - (T - Tk:i)/Tk:i for the KT transition. Thus, numer- 
ical studies on large systems are needed. Moreover, the dif- 
ficulties in Monte Carlo simulations come from logarithmic 
corrections that are predicted to be present.^' ^' The magnetic 
susceptibility scales as 



;t-ocL2-''(lnL) 



-2r 



(2) 



with = 1/4, and the theoretical prediction of -2r is 1/8.^' 
There were discrepancies in the estimate of Tkt of two most 
precise results;'*-''' to resolve this discrepancy Hasenbusch^' 
made extensive Monte Carlo simulations on lattices up to the 
linear system size L - 2048. There has been a puzzle in the 
estimate of the exponent to describe the logarithmic correc- 
tion, -2r. The history of the estimates of the exponent -2r is 
listed in the paper by Kenna.^' We note that the critical tem- 
perature and the logarithmic-correction exponent were also 
calculated by the high-temperature expansion of the magnetic 
susceptibility to orders f3^^ for the classical XY model. ^' 

The Monte Carlo simulation has served as a standard nu- 
merical tool in the field of many body problems in physics. 
However, the Metropolis-type single-spin-flip algorithm^* of- 
ten suffers from the problem of slow dynamics or critical 
slowing down. To overcome this difficulty, a multi-cluster 
flip algorithm was proposed by Swendsen and Wang (S W). 
Wolff"' proposed another type of cluster algorithm, that is, a 
single-cluster algorithm. 

Tomita and Okabe'^ '^' developed a cluster algorithm, 
which is called the probability-changing cluster (PCC) algo- 
rithm, of locating the critical point automatically. It is an ex- 



tension of the SW algorithm,'"' but one changes the probabil- 
ity of cluster update (essentially, the temperature) during the 
Monte Carlo process. In the original paper, '^' Tomita and Ok- 
abe investigated the second-order phase transition of discrete 
spin models, such as the ^-state Potts model, but the PCC al- 
gorithm was extended to the problem of the vector order pa- 
rameter'*' with the use of Wolff^'s embedded cluster formal- 
ism.'" 

High performance computing triggers advances in science. 
The use of general purpose computing on graphics processing 
unit (GPU) is a recent hot topic in computer science. Dras- 
tic reduction of processing times can be realized in scientific 
computations. Using the common unified device architecture 
(CUDA) released by NVIDIA, it is now easy to implement 
algorithms on GPU using standard C or C++ language with 
CUDA specific extension. A parallel computing is performed 
using many threads in a single GPU. Moreover, larger-scale 
problems beyond the capacity of video memory on a single 
GPU can be accommodated by multiple GPUs. Multiple GPU 
computing requires GPU-level parallelization. A large-scale 
open science supercomputer TSUBAME 2.0 is available at 
the Tokyo Institute of Technology. TSUBAME 2.0 consists of 
4224 NVIDIA Tesla M2050 GPUs as a total, and the theoret- 
ical peak performance reaches 2.4 PFLOPS. 

A successful application of GPU computing to the 
Metropolis-type single-spin-flip algorithm was proposed by 
Preis et a/. '^•'^' The GPU-based calculation of the multispin 
coding of the Monte Carlo simulation was reported,'^' where 
the multiple GPU calculation was argued. High performance 
computing using GPUs is highly desirable for Monte Carlo 
simulations with cluster flip algorithms. Recently, some at- 
tempts have been reported along this line on a single GPU. 
Weigel"*' has studied parallelization of cluster labeling and 
cluster update algorithms for calculations with CUDA. He re- 
alized the SW multi-cluster algorithm by using the combina- 
tion of self-labeling algorithm and label relaxation algorithm 
or hierarchical sewing algorithm. The present authors'^' have 
proposed the GPU calculation for the SW multi-cluster algo- 
rithm by using the two connected component labeling algo- 
rithms, the algorithm by Hawick et al.~^^ and that by Kalen- 
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speed for the 2D q - 2 Potts model on NVIDIA GeForce 
GTX580 was 12.4 times as fast as that on a current CPU core, 
Intel Xeon CPU W3680. More recently, we have extended 
the GPU-based calculation on a single GPU to multiple GPU 
computation.^^' To deal with multiple GPUs, we use the mes- 
sage passing interface (MPI) library for communication. We 
employ a two-stage process of cluster labeling; that is, the 
cluster labeling within each GPU and the inter-GPU label- 
ing. We have tested the performance for the 2D q'-state Potts 
model. By using 256 GPUs we have treated systems up to 
L = 65536. 

In this paper, we study the 2D classical XY model by the 
use of large scale computations with multiple GPUs on the 
system of TSUBAME 2.0. We use a generalized version of the 
PCC algorithm based on the helicity modulus. We locate the 
KT temperature with the self-adapted approach of the PCC 
algorithm. We pay attention to the logarithmic corrections of 
the susceptibility; we study the exponent -2r to specify the 
logarithmic correction. 

The Hamiltonian of the classical XY model is given by 



-J^Si- Sj = ^ C0S(6»; - Oj), 



(3) 



<1J> 



</J> 



where Si is a unit vector with two real components. The sum 
is over nearest-neighbor sites of square lattice (L x L) with 
periodic boundary conditions. 

The helicity modulus gives the reaction of the system un- 
der a torsion. The Kosterlitz renormalization-group equations 
lead to the universal jump of the helicity modulus;^' that is, 
from the value (2/n)T}c[ to at T^t in the thermodynamic 
limit. The helicity modulus T can be expressed as^"^'^^-* 



where 



r^{e)- —is'-), 



-i 2 cos{6i-ej). 



<IJ>y 



12 Yj 



sin(0; - Oj), 



(4) 
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(6) 
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and the sum is over all links in one direction. We note that 
the definition of T in Ref. 6 includes jS; that is, Tnasenbusch - 
jST. Hasenbusch^' calculated the tiny correction to the critical 
value of j6T at Tkt due to winding field for periodic boundary 
conditions; the calculated critical value is 0.636508, which 
differs from 2/7r = 0.636620 by 0.02%. 

In the original formulation of the PCC algorithm,'-' one in- 
creases or decreases temperature, depending on the observa- 
tion whether clusters are percolating or not. The critical tem- 
perature is determined by using the finite-size scaling (FSS) 
property of the existence probability of percolation. Tomita 
and Okabe also presented a generahzed scheme of the PCC 
algorithm-^' based on the FSS property of the ratio of corre- 
lation functions with different distances. In the present paper, 
we use the helicity modulus as a basis of the PCC algorithm. 
The reason to use the helicity modulus is that for the mul- 
tiple GPU computation there are difficulties in the check of 
percolation and the calculation of correlation function with 
long distance. These difficulties are due to distributed memo- 
ries in the multiole GPU svstem. The actual orocedure for the 



1.115 



1.11 



1.105 




250 

MCS (X1000) 

Fig. 1. (Color online) Time evolution of /3 with the PCC algorithm for the 
2D classical XY model. The system sizes are 512, 4096, and 32768. 



change of temperature is as follows: If the helicity modulus 
y6T is smaller (larger) than 0.636508, we increase (decrease) 
y8byAj6(> 0). 

We have made simulations for the classical XY model on 
the square lattice with the system sizes L - 64, 128, 256, 512, 
1024, 2048, 4096, 8192, 16384, 32768, and 65536. Actually, 
we use a CPU, Xeon W3680, for L up to 256, and a single 
GPU, NVIDIA GeForce GTX580, for 512 < L < 4096. We 
use a multiple GPU system, TSUBAME 2.0, for the system 
L > 8192. For multiple GPUs, we assign the sub-lattice of 
size 4096 x 4096 for each GPU. That is, the systems with L - 
8192, 16384, 32768, and 65536 are realized by 4, 16, 64, and 
256 GPUs. 

We discard 60000 Monte Carlo steps (MCS) before mak- 
ing a measurement. To measure the helicity modulus we take 
1000 MCS. We choose ^|3 as 0.0005. Throughout this paper 
we use the unit of / = 1 . First, we plot the time evolution of 
jS in Fig. 1. The time steps are given in units of 1000 MCS. 
For illustration, the data for L = 512, 4096, and 32768 are 
shown. From Fig. 1, we see that the temperature is oscillating 
around the average value. The width of fluctuation becomes 
smaller as the system size increases because of the effect of 
self-averaging. 

We take an average of inverse temperature over long steps; 
we denote it by /3KTiL) for each size. We tabulate /SKjiQ in 
Table I. The numbers of measured steps in units of 10^ MCS 
for each size are also given in Table I. The numbers of mea- 
sured MCS range from 4 x 10^ to 4 x 10'' depending on L. The 
uncertainties of measured values, denoted in the parentheses, 
are estimated by the short-time average. 

Then, we consider the size dependence of /3ki:{L)- We use 
the FSS analysis based on the KT form of the correlation 
length, Eq. (1). Using the PCC algorithm, we obtain the in- 
verse temperature /3kt{L) such that the helicity modulus fTT 
is 0.636508. It means that ^/L(= a) is constant for each size. 
Then, using the FSS form of T, that is, T - Y(^/L), we have 
the relation 



(7) 



(ln/7L)2' 

where b — a /A. We plot f^KTiL) as a function of with 
I - In bL for the best fitted parameters in Fig. 2. The error 
bars, which are very small, are shown there. We try several 
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Table I. Estimates of /3KT(i), >L and ^i^iDIL for the 2D classical XY 
model using the PCC algorithm. The numbers of measured steps in units of 
10^ MCS are given. 
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Fig. 2. (Color online) Plot of /?KT(i) of the 2D classical XY model for L = 
64, 128, 256, 512, 1024, 2048, 4096, 8192, 16536, 32768, and 65536, where 
/ = InbL. The best fitted line, 1.11996 - 0.826 X T^, is shown with b = 3.6. 



estimates; that is, the minimum size L for the check of x~ 
values is taken as 64, 128, 256, and 512. From smaller val- 
ues, we actually use the data of the minimum size L being 
128 and 256. Our estimate of j6kt is 1.11996(6); the numbers 
in the parentheses denote the uncertainty in the last digits. 
This value is consistent with the estimates of recent studies; 
1 . 1 1 99( 1 ) by the Monte Cai'lo simulation,''' and 1 . 1 200( 1 ) by 
the high-temperature expansion.^' 

Next consider the correlation length of second moment 

^2nd(i); 

\x{Q)lx{2nlL)-\]'l^ 



with 



ftnd(i) 



- N 



2 sin(;r/L) 



2 ^('■y'' 



(8) 



(9) 



In Ref. 6, ^2nd(L)/L is calculated as 0.7506912 for L ^ oo 
at the KT temperature Tkt- We tabulate ^2nd{L)/L at TnjiL) 
in Table I, and plot them in Fig. 3, where the horizontal axis 
is chosen as the same as Fig. 2. We see that ^2nA{I-)IL ap- 
proaches 0.750 ■ ■ ■ rapidly even for small L. We also make 
simulations aX j3 - 1.1199 to confirm the consistency with 



Fig. 3. (Color online) Plot of ftnd(L)/Z. at rKT(i) for L = 64, 128, 256, 
512, 1024, 2048, 4096, 8192, 16536, 32768, and 65536. The horizontal axis is 
the same as that of Fig. 2. The data with the PCC algorithm ai'e compared with 
those at the fixed temperature = 1.1 199. The calculated value for L = oo by 
Hasenbusch,*' 0.75069 ■ • • , is given by a dotted fine for convenience. 



the calculation by Hasenbusch.^' For comparison we plot 
^2nd(L)/L with fixing jS = 1.1199 in Fig. 3, which approaches 
0.750 ■ ■ ■ slowly with the increase of size. 

We now turn to the logarithmic-correction exponent -2r, 
given in Eq. (2). It comes from the multiplicative logarithmic 
corrections in terms of correlation length, such that 



-2r 



(10) 



Since we measure (m^)/, at the temperature TjcriL) with ^/L - 
a for each size, we have 



{m^)LL^'^ (InaL) 



-2r 



(11) 



The measured data for {m^)i are tabulated in Table I. If we 
take logarithm of both sides of Eq. ( 1 1 ), we have 

In (<m2)z, L'/'^) = const -2rln(lnaL). (12) 

We plot \n{{m^)LL}l'^) as a function of In(lnflL) with the best 
fitted parameter in Fig. 4, where a is chosen as 16.0. Checking 
the;^^^ values, we determine —2r as 0.128(8). In Fig. 4, we also 
give a plot of \vi{{m^)iL}^'^) as a function of In(lnL); that is, 
a - 1 . If we use only the data of a = 1 for smaller lattices 
(say, 64 < L < 2048), the best fitted slope, which indicates 
-2r, in Fig. 4 becomes 0.07; whereas the slope becomes 0. 1 1 
when using only the data for larger lattices (say, 2048 < L < 
65536). Thus, for large enough L, up to 65536, we can say that 
the slope, -2r, approaches the theoretical value 1/8 (=0.125) 
irrespective of the choice of a. 

We observe that the slope, which indicates -2r, is smaller 
for small L if we do not consider a. For large enough L, up 
to 65536, we clearly see that the slope, -2r, approaches the 
theoretical value 1/8 irrespective of the choice of a. We may 
conclude that the puzzle has been finally solved. 

We note that the thermal average of the physical quantities, 
such as {m^)i, with the PCC algorithm is consistent with the 
T-fix calculation within the error bars. In the PCC algorithm, 
T is changing, but the thermal average of physical quantity 
coincides with the T-fix calculation because the fluctuation 
width is small enough. 
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Fig. 4. (Color online) Plot of ln«j7r>z.L"'*) as a function of ln(ln(aL)) at 
Tkt{L) for L = 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16536, 32768, 
and 65536. The best fitted line with a = 16.0, -0.262 + 0.128 X ln(ln(aL)), is 
given. We also plot the case of o = 1.0 for comparison 



the PCC algorithm is efficient to locate the critical tempera- 
ture, not only for the second-order transition but also for the 
KT transition. We finally emphasize that GPU computation 
is very effective when a large-size calculation is needed, for 
example, for systems where logarithmic behavior is essential. 
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